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1. INTRODUCTION 

We appreciate having the opportunity to com- 
ment on the well- motivated, highly informative and 
carefully constructed article by Imai, King and Nail 
(IKN) . There has been a great deal of confusion over 
the years about the issue of pair-matching, often due 
to a conflation of the implications of design versus 
analysis choice. This article sheds light on the de- 
bate and offers a set of helpful alternative analysis 
choices. 

Our discussion does not take issue with IKN's 
provocative assertion that one should pair-match in 
cluster randomized trials "whenever feasible." In- 
stead we will explore the trade-offs between using 
the inferential framework advocated by IKN versus 
fitting fairly standard multilevel models (see, for in- 
stance, Gelman and Hill, 2007). 

The IKN design-based treatment effect estimators 
have the advantage of being simple to calculate and 
having better statistical properties in general than 
the harmonic mean estimator that IKN view to be 
the most standard estimator in this setting. Vari- 
ance estimators for SATE and CATE are not iden- 
tified, but that is a function of not making the as- 
sumption of constant treatment effects, which we 
find realistic. IKN do provide upper bound variance 
estimators for these quantities of interest. Perhaps 
the biggest drawback to these methods is that they 
are not flexible if it is necessary or helpful to extend 
the framework to accommodate additional compli- 
cations or information. 
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The strength of multilevel models in this estima- 
tion setting is the flexibility to build in complex- 
ity that could provide us with additional informa- 
tion, increase our precision, or sometimes even re- 
duce bias (for instance, when correcting for "bro- 
ken" randomization). As an example, while the IKN 
variance estimators accommodate varying treatment 
effects, the multilevel model provides a framework to 
actually examine these pair-to-pair differences. The 
model can also be extended to allow treatment ef- 
fects to vary over covariate-defined subgroups which 
has the potential to substantially increase our un- 
derstanding of effect transmission. Conditioning on 
pre-treatment covariates can also help to increase 
precision (and even reduce bias in situations where 
the randomization has been less pristine). Moreover, 
not only can multilevel models include covariates 
and random treatment effects quite readily, but the 
need for such terms can be evaluated statistically. 

A further example is the ability of models to ac- 
commodate missing data at the individual level (rather 
than entire clusters being missing due to group-level 
noncompliance or attrition which IKN address) . This 
can be naturally incorporated into a model-based 
framework as well; it's unclear how the IKN frame- 
work would handle this complication. 

Of course, these advantages come at the cost of 
making some modeling assumptions. IKN go so far 
as to claim that these approaches "violate the very 
purpose of experimental work which goes to great 
lengths and expense to avoid these types of assump- 
tions." However, the primary purpose of experimen- 
tal work is to avoid the untestable assumption of 
ignorability (or strong ignorability) that is so dif- 
ficult to avoid in observational work. While it is 
true that we do not need to build models post- 
randomization in order to estimate treatment ef- 
fects, this can hardly be viewed as the goal of ran- 
domized experiments. In fact, randomization actu- 
ally increases robustness to model-misspecification, 
creating a safer climate within which to build models 
than would otherwise exist. Moreover, the paramet- 
ric assumptions we make with a multilevel model 
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are testable, for instance, using graphical regression 
diagnostics. 

It could be argued that multilevel models have 
the disadvantage of being more complicated to fit. 
However with the capabilities of current standard 
statistical software the level of technical expertise 
required to fit such models is well within the reach 
of most applied researchers today. 

2. SIMPLE MULTILEVEL MODELS FOR 
ESTIMATING TREATMENT EFFECTS 

First we lay out a few simple multilevel models for 
estimating treatment effects in the setting of a pair- 
matched cluster-randomized experiment. Clearly we 
have not exhausted all possibilities, but the models 
we discuss have the advantage of being relatively 
simple, easily fit with standard software, and readily 
expandable to more complex settings. 

A very simple model for observation % in cluster j 
and pair k is 

(1) Yijk = r Tjk + a k + £ ijk, 

with a common treatment effect, r, as well as vary- 
ing intercepts ak, where ak ~ N(ao, Tjk is a 
treatment indicator, and j € {1,2} while k £ {1,2, 
. . . , K}. As is common in multilevel models of this 
sort, the random terms are assumed independent of 
the predictors, an assumption which is particularly 
defensible in the context of a randomized experi- 
ment, and E{jk ~ iV(0, erf ). 

A simple adjustment, allowing r to vary by pair, 
yields a model for heterogeneous treatment effects: 

(2) Yij k = T kTjk + «fc + Sijk , 

with Tfc ~ N(to,(?t)- We- typically do not want to 
assume that ak and Tk are independent, therefore 
there is a covariance term in the model, a ar , and 
the pair (a^Tfc) are assumed bivariate normal. 

A word of caution is warranted with regard to the 
Tk ■ These parameters cannot be interpreted causally 
except in the special case in which we know that 
clusters have been perfectly matched on their po- 
tential outcomes (which is implausible in practice). 
Otherwise, we cannot separately identify variation 
caused by within-pair cluster mismatch from vari- 
ation that is due to treatment effects that actually 
vary across pairs. Nonetheless, allowing r to vary is 
important because it allows us to test for this extra 
source of heterogeneity (whatever the true source of 
the heterogeneity). To the extent that we can sat- 
isfy ourselves that we have indeed obtained close 



matches (mostly likely after having also conditioned 
on some highly predictive pre-treatment variables), 
we can move toward a causal interpretation of these 
quantities. However, if our goal is to explore treat- 
ment effect moderation, we're probably better off 
doing so by (additionally) allowing the treatment 
effects to vary by covariate levels. 

We can augment either of these models by includ- 
ing cluster-level covariates, Xj. This is particularly 
helpful when we are unable to perfectly match clus- 
ters. Here we focus on inclusion of covariates purely 
for increasing precision (not to explore treatment 
effect moderation). In this case we add a cluster- 
specific level to the model, as in 

Yijk — T~kTjk + (f>jk ~\~ &ijk, 

(3) 

4>jk — XjkP + ak, 

where (j>jk captures cluster-specific variation that de- 
pends on both Xjk and our varying pair intercepts, 

Oik- 

3. EXAMINING THE IMPLICATIONS OF 
IMPERFECT MATCHING AND TREATMENT 
EFFECT HETEROGENEITY 

We explore the implications of imperfect matching 
and the presence of treatment effect heterogeneity 
through a small set of simulations. Our primary sim- 
ulations vary the following components: (i) cluster 
size perfectly or imperfectly matched, (ii) cluster- 
specific SATE perfectly or imperfectly matched and 
(iii) treatment effect fixed or varying. Simulations 
are repeated 100 times for each scenario. 

The data generated in each simulation are fit using 
the two multilevel models laid out in equations (1) 
and (2) above (we'll refer to them as MLM1 and 
MLM2, for the constant and varying treatment ef- 
fect models, respectively). To represent an analy- 
sis option that would be easy to use by an applied 
researcher we fit the multilevel models using the 
lmer command (package is lme4) in R (R Develop- 
ment Core Team, 2008; very similar packages exist 
in Stata, SAS and SPSS, among others) and used the 
standard estimates. In theory, however, one could 
fit these models using a more flexible package such 
as BUGS or JAGS in which case it would be triv- 
ial to reweight the in order to make inferences 
about any of a wide range of different quantities of 
interest. For comparison purposes we fit the IKN 
SATE estimator (to mirror the multilevel model's 
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implicit weighting scheme by pair sample size) us- 
ing the upper bound variance estimate to demon- 
strate the relationship between this bound and the 
uncertainty estimate in MLM2. Since we can com- 
pare nested multilevel models using likelihood ratio 
tests (LRTs), we also evaluate whether the model 
detects evidence of variation in treatment effects. 
We then extend the simulations to incorporate a 
cluster-level covariate as described in more detail be- 
low, and fit the multilevel model described in equa- 
tion (3) above. 

We simulate matched-pair cluster randomized ex- 
periments in a manner similar to the IKN simu- 
lations with the notable difference that we do not 
force cluster-specific SATE to be perfectly matched 
within pair (as described below). The number of 
pairs in each simulation is 30. Throughout our sim- 
ulations, the average (or fixed) cluster size is 50. 
When cluster size is imperfectly matched across clus- 
ters, we allow it to vary based on multinomial draws 
from cluster labels, each equally likely, drawing a 
sufficient sample so that the expected size of any 
cluster is 50. Using this strategy, the average differ- 
ence in cluster size is about 8 and the average stan- 
dard deviation of these differences across repeated 
simulations is approximately 6. 

Potential outcomes were simulated such that for 
a given pair k, with cluster j = 1 as the control, and 
cluster j = 2 as the treated, 1 

(4) Y. lk (0)~N(fi ,a 2 ), 

(5) Y 2k (0)=Y lk (0) + 8 k with4~iV(0,7r 2 ag). 



1 After submitting a draft of this comment, IKN asked 
for our code and upon reviewing confirmed an error in our 
original simulation setup; randomization had not been im- 
posed. Given the necessary restrictions on iterative revisions 
in this discussion setting we will not update our comment 
to incorporate the corrected results; however we were per- 
mitted to change the description of the simulations to reflect 
what was actually run (that is, with the error included). We 
have, however, verified that after correcting our simulations 
(by imposing randomization such that both potential out- 
comes and our covariate are independent of the treatment, 
the situation described in the rest of the discussion), our orig- 
inal conclusions remain; the results are extremely similar to 
those presented here. More details and the code appear in our 
online appendix at https://files.nyu.edu/jlhl7/public/ 
stat . sci . appendices/. 

The error did spark an interesting additional discussion 
about the potential problems with adjusting for covariates 
when randomization has failed which IKN explore in their 
rejoinder. 



Therefore 5 k serves the role of creating imbalance 
in SATE across clusters (IKN do not allow for this 
in their simulations). As tt grows we move from a 
situation with perfect balance to a situation in which 
we may as well have randomly chosen pair matches. 

Treatment effects were either kept constant across 
pairs at Tj k = 3.2 for all j and k or were allowed 
to vary. Heterogeneous treatment effects were gen- 
erated using a nonlinear deterministic function of 
the cluster potential outcome under treatment such 
that Tj k = 30/Y.j k (0). This creates a partial ceiling 
effect in which larger baseline values are associated 
with smaller treatment effects and as such the dis- 
tribution for both Y(l) and Tj k are quite skewed 
(again mimicking the IKN example). 2 The mean of 
Tj k across j and k is about 3.2 on average under this 
formulation. 

Individual-level observations are generated from 
these cluster potential outcomes by adding random 
errors, 

(6) Y ijk (-) = Y. jk (-) + e ijk , 

with €ij k ~ N(0, cr 2 ). We chose fio = 10, of = 1 and 
do = 4 for all simulations. 

4. SIMULATION RESULTS 

In Figure 1, we plot the standard error associated 
with our three estimates of the common treatment 
effect when cluster size is not perfectly matched (the 
scenario in which cluster sizes are equal is nearly 
identical, with minor differences noted below). Panel 
A displays the results in the scenario when treat- 
ment effects are constant (ignore the thick grey line 
at this point in the discussion). When tt = 0, match 
quality is perfect, and multilevel model estimators 
have the same precision. However, as the match qual- 
ity degrades (as represented by increasing levels of 
tt) the lines diverge rapidly with MLM2 reflecting 
increasingly higher levels of uncertainty. 

We might think that MLM1 is the "correct" model 
in this simulation scenario — after all, the treatment 
is constant. However, in terms of heterogeneity, there 
is no identifiable difference between poor matches 



2 We were able to obtain data from IKN on the distributon 
of pair specific differences in means in the data they used as 
a starting point for their simulation. We satisfied ourselves 
that the distribution in our simulations of the same quantity 
is even more skewed, thus clearly violates the assumption of 
normality of the treatment effects built in to our multilevel 
model. 
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Panel A. Unequal Cluster Size Panel B. Unequal Cluster Size 

Common Treatment Effect Variable Treatment Effect 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Match Quality: perturbation s.d. Match Quality: perturbation s.d. 

as fraction of between pair s.d.. as fraction of between pair s.d.. 



Fig. 1. Plots that display how the standard error for each method varies with increasing disparity in the matches, as measured 
by ir. The left panel displays results from the scenario with constant treatment effects; the right panel displays results from the 
scenario with treatment effect heterogeneity. 



and variable treatment effects. So in the likely realm 
of imperfect matches, which model do we prefer, 
and why? Model selection techniques such as LRTs 
will guide us toward models that capture variation, 
when it is present, and we find that when tt > 0.13, 
the null that = is rejected at the 0.05 level. We 
represent this shift away from MLM1 by slowly grey- 
ing out its standard error in panel A. Thus models 
can provide evidence of either imperfect matching 
or variable treatment effects, but with this design, 
cannot adjudicate between the two. Of course, to the 
extent that we can use covariates to sufficiently im- 
prove across-cluster equivalence within pairs (that 
is, to make up for imbalance remaining after match- 
ing), we might have more confidence in using such 
tests to infer that the treatment effect is constant. 

What's interesting to note is that the IKN upper 
bound variance estimator for SATE closely mimicks 
the uncertainty estimate from MLM2. When clus- 
ter size is equal, the IKN estimator's precision is 
nearly coincident with that of MLM2 (not shown). 
Of course, neither the IKN estimator nor MLM2 can 
distinguish between true treatment effects and pair 
mismatches. 



In Figure 1, Panel B, we plot the standard error 
associated with two of our three estimates of the 
common treatment effect when cluster size is not 
perfectly matched and treatment effects vary non- 
linearly as specified above. Again, ignore the thick 
grey line at this point in the discussion. When tt = 0, 
there is already a difference in precision between 
MLM1 and MLM2. MLM1 completely ignores any 
heterogeneity, so with this incorrect assumption, it 
underestimates the uncertainty. Given that LRTs 
would correctly suggest that MLM1 is insufficient, 
in other words, o\ > 0, we do not include MLMl's 
precision in the plot. 

We now concentrate on MLM2 and the IKN esti- 
mator, and again we see that the precision follows 
a comparable trend as it is increased and matches 
degrade. Imperfect matches and the varying treat- 
ment effects are increasingly confounded, and the 
uncertainty concomitantly increases. It is somewhat 
surprising that the precision curves degrade at a 
slower rate than those in Panel A, yielding supe- 
rior precision when tt > 0.5. This can be attributed, 
however, to the additional information contained in 
the correlation between treatment and pair effects, 
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created by the nonlinear transformation that gener- 
ates Tjk- We confirmed this using a different simu- 
lation setup for which variable treatment and pair 
effects were generated independently (not shown); in 
these simulations, the precision degrades a bit more 
quickly than in the case of common treatment effect, 
as expected. That MLM2 and the IKN estimator's 
s.e. are at higher levels in Panel B when tt < 0.5 is 
simply the effect of increased baseline variation in 
treatment effects introduced in the simulations with 
variable Tjk- 

An equally important point here is that even though 
we in essence incorrectly model the skewed treat- 
ment effects by pretending they are normally dis- 
tributed, this "model failure" did not introduce bias 
or reduce precision (skew may have induced slight 
overestimation of the variance in the treatment). 

4.1 Covariates 

We operationalize covariates as having partial in- 
formation on Y.jk(0), but not directly on Tjk- The 
simplest way to do this, in our simulations, is to set 
Xjk = Y.jk(0) + (j, where Q ~ N(0,a^) is a noise 
process that limits our ability to recover the level of 
the potential outcome. When a 2 is small, we should 
eliminate, or nearly eliminate the variation between 
pairs, <7q. This should result in increased precision 
for the treatment effect, particularly when treat- 
ment and within pair differences are greatly con- 
founded. In the simulation results shown in Figure 
1, we chose a 2 = 0.2 2 , which is large enough to ob- 
scure some information in the covariate, but not so 
large as to render it nonsignificant, and fit the model 
given in (3) above. The standard errors for treat- 
ment effects (our primary assessment) are presented 



as a grey line which is remarkably constant across 
various levels of match quality. Panels A and B are 
quite similar, so our remarks apply to either. When 
match quality is very good, conditioning on covari- 
ates actually adds a small amount of uncertainty to 
the treatment estimate. However, the payoffs asso- 
ciated with covariates include: dramatic reduction 
of between pair variance a 2 a , which provides the op- 
portunity to identify a simpler (common treatment) 
model, when this actually is the case, and improved 
precision when match quality is poor. To summarize, 
the impact of a (significant) covariate or set of co- 
variates should be to decrease the variance cr 2 , and 
this has the potential to yield remarkable precision 
gains. 

5. CONCLUSION 

In some ways, the IKN framework is actually quite 
similar to the multilevel framework that allows for 
variation in treatment effects across pairs. The ad- 
vantage of the multilevel framework however is in 
moving beyond the simplest scenario to incorpo- 
rate additional complexity for greater precision or 
greater understanding. We have illustrated only a 
small number of the potential set of such model ex- 
pansions. 

REFERENCES 

Gelman, A. and Hill, J. (2007). Data Analysis Using Re- 
gression and Multilevel/Hierarchical Models. Cambridge 
Univ. Press, Cambridge. 

R Development Core Team (2008). R: A Language and 
Environment for Statistical Computing. R Foundation for 
Statistical Computing, Vienna, Austria. ISBN 3-900051- 
07-0. Available at http://www.R-project.org. 



